Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: FAISS vectorstore - consistent Document id field #27101

Closed
wants to merge 669 commits into from

Conversation

nhols
Copy link
Contributor

@nhols nhols commented Oct 4, 2024

  • Set id field of documents in a FAISS docstore to be consistent with values in index_to_docstore_id
  • Implement get_by_ids method for FAISS vectorstore
  • Add tests

Copy link

vercel bot commented Oct 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Dec 15, 2024 4:58pm

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: vector store Related to vector store module labels Oct 4, 2024
prithvikannan and others added 26 commits November 21, 2024 20:03
…8274)

Thank you for contributing to LangChain!

Ctrl+F to find instances of `langchain-databricks` and replace with
`databricks-langchain`.

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

Signed-off-by: Prithvi Kannan <[email protected]>
- **docs: poetry publish**
- **x**
- **x**
- **x**
- **x**
- **x**
- **x**
- **x**
- **x**
- **x**
…ain-ai#28269)

- **Description:** Corrected the parameter name in the
HuggingFaceEmbeddings documentation under integrations/text_embedding/
from model to model_name to align with the actual code usage in the
langchain_huggingface package.
- **Issue:** Fixes langchain-ai#28231
- **Dependencies:** None
…ssages` (langchain-ai#28267)

We have a test
[test_structured_few_shot_examples](https://github.com/langchain-ai/langchain/blob/ad4333ca032033097c663dfe818c5c892c368bd6/libs/standard-tests/langchain_tests/integration_tests/chat_models.py#L546)
in standard integration tests that implements a version of tool-calling
few shot examples that works with ~all tested providers. The formulation
supported by ~all providers is: `human message, tool call, tool message,
AI reponse`.

Here we update
`langchain_core.utils.function_calling.tool_example_to_messages` to
support this formulation.

The `tool_example_to_messages` util is undocumented outside of our API
reference. IMO, if we are testing that this function works across all
providers, it can be helpful to feature it in our guides. The structured
few-shot examples we document at the moment require users to implement
this function and can be simplified.
langchain-ai#28296)

**Description:**

Currently, the docstring for `LanceDB.__init__()` provides the default
value for `mode`, but not the list of valid values. This PR adds that
list to the docstring.

**Issue:**

N/A

**Dependencies:**

N/A

**Twitter handle:**

`@metadaddy`

[Leaving as a reminder: If no one reviews your PR within a few days,
please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda,
hwchase17.]
pydantic 2.10 compat for langchain-core
fix small GOOGLE_API_KEY markdown formatting typo
Adds deprecation notices for Neo4j components moving to the
`langchain_neo4j` partner package.

- Adds deprecation warnings to all Neo4j-related classes and functions
that have been migrated to the new `langchain_neo4j` partner package
- Updates documentation to reference the new `langchain_neo4j` package
instead of `langchain_community`
pprados and others added 8 commits December 13, 2024 21:24
)

JSONparse, in _validate_metadata_func(), checks the consistency of the
_metadata_func() function. To do this, it invokes it and makes sure it
receives a dictionary in response. However, during the call, it does not
respect future calls, as shown on line 100. This generates errors if,
for example, the function is like this:
```python
        def generate_metadata(json_node:Dict[str,Any],kwargs:Dict[str,Any]) -> Dict[str,Any]:
             return {
                "source": url,
                "row": kwargs['seq_num'],
                "question":json_node.get("question"),
            }
        loader = JSONLoader(
            file_path=file_path,
            content_key="answer",
            jq_schema='.[]',
            metadata_func=generate_metadata,
            text_content=False)
```
To avoid this, the verification must comply with the specifications.
This patch does just that.

---------

Co-authored-by: Eugene Yurtsev <[email protected]>
…i#25375)

community: add hybrid search in opensearch

# Langchain OpenSearch Hybrid Search Implementation

## Implementation of Hybrid Search: 

I have taken LangChain's OpenSearch integration to the next level by
adding hybrid search capabilities. Building on the existing
OpenSearchVectorSearch class, I have implemented Hybrid Search
functionality (which combines the best of both keyword and semantic
search). This new functionality allows users to harness the power of
OpenSearch's advanced hybrid search features without leaving the
familiar LangChain ecosystem. By blending traditional text matching with
vector-based similarity, the enhanced class delivers more accurate and
contextually relevant results. It's designed to seamlessly fit into
existing LangChain workflows, making it easy for developers to upgrade
their search capabilities.

In implementing the hybrid search for OpenSearch within the LangChain
framework, I also incorporated filtering capabilities. It's important to
note that according to the OpenSearch hybrid search documentation, only
post-filtering is supported for hybrid queries. This means that the
filtering is applied after the hybrid search results are obtained,
rather than during the initial search process.

**Note:** For the implementation of hybrid search, I strictly followed
the official OpenSearch Hybrid search documentation and I took
inspiration from
https://github.com/AndreasThinks/langchain/tree/feature/opensearch_hybrid_search
Thanks Mate!  

### Experiments

I conducted few experiments to verify that the hybrid search
implementation is accurate and capable of reproducing the results of
both plain keyword search and vector search.

Experiment - 1
Hybrid Search
Keyword_weight: 1, vector_weight: 0

I conducted an experiment to verify the accuracy of my hybrid search
implementation by comparing it to a plain keyword search. For this test,
I set the keyword_weight to 1 and the vector_weight to 0 in the hybrid
search, effectively giving full weightage to the keyword component. The
results from this hybrid search configuration matched those of a plain
keyword search, confirming that my implementation can accurately
reproduce keyword-only search results when needed. It's important to
note that while the results were the same, the scores differed between
the two methods. This difference is expected because the plain keyword
search in OpenSearch uses the BM25 algorithm for scoring, whereas the
hybrid search still performs both keyword and vector searches before
normalizing the scores, even when the vector component is given zero
weight. This experiment validates that my hybrid search solution
correctly handles the keyword search component and properly applies the
weighting system, demonstrating its accuracy and flexibility in
emulating different search scenarios.


Experiment - 2
Hybrid Search
keyword_weight = 0.0, vector_weight = 1.0

For experiment-2, I took the inverse approach to further validate my
hybrid search implementation. I set the keyword_weight to 0 and the
vector_weight to 1, effectively giving full weightage to the vector
search component (KNN search). I then compared these results with a pure
vector search. The outcome was consistent with my expectations: the
results from the hybrid search with these settings exactly matched those
from a standalone vector search. This confirms that my implementation
accurately reproduces vector search results when configured to do so. As
with the first experiment, I observed that while the results were
identical, the scores differed between the two methods. This difference
in scoring is expected and can be attributed to the normalization
process in hybrid search, which still considers both components even
when one is given zero weight. This experiment further validates the
accuracy and flexibility of my hybrid search solution, demonstrating its
ability to effectively emulate pure vector search when needed while
maintaining the underlying hybrid search structure.



Experiment - 3
Hybrid Search - balanced

keyword_weight = 0.5, vector_weight = 0.5

For experiment-3, I adopted a balanced approach to further evaluate the
effectiveness of my hybrid search implementation. In this test, I set
both the keyword_weight and vector_weight to 0.5, giving equal
importance to keyword-based and vector-based search components. This
configuration aims to leverage the strengths of both search methods
simultaneously. By setting both weights to 0.5, I intended to create a
scenario where the hybrid search would consider lexical matches and
semantic similarity equally. This balanced approach is often ideal for
many real-world applications, as it can capture both exact keyword
matches and contextually relevant results that might not contain the
exact search terms.

Kindly verify the notebook for the experiments conducted!  

**Notebook:**
https://github.com/karthikbharadhwajKB/Langchain_OpenSearch_Hybrid_search/blob/main/Opensearch_Hybridsearch.ipynb

### Instructions to follow for Performing Hybrid Search:

**Step-1: Instantiating OpenSearchVectorSearch Class:**
```python
opensearch_vectorstore = OpenSearchVectorSearch(
    index_name=os.getenv("INDEX_NAME"),
    embedding_function=embedding_model,
    opensearch_url=os.getenv("OPENSEARCH_URL"),
    http_auth=(os.getenv("OPENSEARCH_USERNAME"),os.getenv("OPENSEARCH_PASSWORD")),
    use_ssl=False,
    verify_certs=False,
    ssl_assert_hostname=False,
    ssl_show_warn=False
)
```

**Parameters:**
1. **index_name:** The name of the OpenSearch index to use.
2. **embedding_function:** The function or model used to generate
embeddings for the documents. It's assumed that embedding_model is
defined elsewhere in the code.
3. **opensearch_url:** The URL of the OpenSearch instance.
4. **http_auth:** A tuple containing the username and password for
authentication.
5. **use_ssl:** Set to False, indicating that the connection to
OpenSearch is not using SSL/TLS encryption.
6. **verify_certs:** Set to False, which means the SSL certificates are
not being verified. This is often used in development environments but
is not recommended for production.
7. **ssl_assert_hostname:** Set to False, disabling hostname
verification in SSL certificates.
8. **ssl_show_warn:** Set to False, suppressing SSL-related warnings.

**Step-2: Configure Search Pipeline:**

To initiate hybrid search functionality, you need to configures a search
pipeline first.

**Implementation Details:**

This method configures a search pipeline in OpenSearch that:
1. Normalizes the scores from both keyword and vector searches using the
min-max technique.
2. Applies the specified weights to the normalized scores.
3. Calculates the final score using an arithmetic mean of the weighted,
normalized scores.


**Parameters:**

* **pipeline_name (str):** A unique identifier for the search pipeline.
It's recommended to use a descriptive name that indicates the weights
used for keyword and vector searches.
* **keyword_weight (float):** The weight assigned to the keyword search
component. This should be a float value between 0 and 1. In this
example, 0.3 gives 30% importance to traditional text matching.
* **vector_weight (float):** The weight assigned to the vector search
component. This should be a float value between 0 and 1. In this
example, 0.7 gives 70% importance to semantic similarity.

```python
opensearch_vectorstore.configure_search_pipelines(
    pipeline_name="search_pipeline_keyword_0.3_vector_0.7",
    keyword_weight=0.3,
    vector_weight=0.7,
)
```

**Step-3: Performing Hybrid Search:**

After creating the search pipeline, you can perform a hybrid search
using the `similarity_search()` method (or) any methods that are
supported by `langchain`. This method combines both `keyword-based and
semantic similarity` searches on your OpenSearch index, leveraging the
strengths of both traditional information retrieval and vector embedding
techniques.

**parameters:**
* **query:** The search query string.
* **k:** The number of top results to return (in this case, 3).
* **search_type:** Set to `hybrid_search` to use both keyword and vector
search capabilities.
* **search_pipeline:** The name of the previously created search
pipeline.

```python
query = "what are the country named in our database?"

top_k = 3

pipeline_name = "search_pipeline_keyword_0.3_vector_0.7"

matched_docs = opensearch_vectorstore.similarity_search_with_score(
                query=query,
                k=top_k,
                search_type="hybrid_search",
                search_pipeline = pipeline_name
            )

matched_docs
```

twitter handle: @iamkarthik98

---------

Co-authored-by: Karthik Kolluri <[email protected]>
Co-authored-by: Eugene Yurtsev <[email protected]>
…gchain-ai#28374)

**Description:** This PR introduces a `model` alias for the embedding
classes that contain the attribute `model_name`, to ensure consistency
across the codebase, as suggested by a moderator in a previous PR. The
change aligns the usage of attribute names across the project (see for
example
[here](https://github.com/langchain-ai/langchain/blob/65deeddd5dfec5d51f33ebc961f09c2e47a8f064/libs/partners/groq/langchain_groq/chat_models.py#L304)).
**Issue:** This PR addresses the suggestion from the review of issue
langchain-ai#28269.
**Dependencies:**  None

---------

Co-authored-by: Eugene Yurtsev <[email protected]>
Co-authored-by: Erick Friis <[email protected]>
…rkdownifyTransformer` (langchain-ai#27866)

# Description
Implements the `atransform_documents` method for
`MarkdownifyTransformer` using the `asyncio` built-in library for
concurrency.

Note that this is mainly for API completeness when working with async
frameworks rather than for performance, since the `markdownify` function
is not I/O bound because it works with `Document` objects already in
memory.

# Issue
Fixes langchain-ai#27865

# Dependencies
No new dependencies added, but
[`markdownify`](https://github.com/matthewwithanm/python-markdownify) is
required since this PR updates the `markdownify` integration.

# Tests and docs
- Tests added
- I did not modify the docstrings since they already described the basic
functionality, and [the API docs also already included a
description](https://python.langchain.com/api_reference/community/document_transformers/langchain_community.document_transformers.markdownify.MarkdownifyTransformer.html#langchain_community.document_transformers.markdownify.MarkdownifyTransformer.atransform_documents).
If it would be helpful, I would be happy to update the docstrings and/or
the API docs.

# Lint and test
- [x] format
- [x] lint
- [x] test

I ran formatting with `make format`, linting with `make lint`, and
confirmed that tests pass using `make test`. Note that some unit tests
pass in CI but may fail when running `make_test`. Those unit tests are:
- `test_extract_html` (and `test_extract_html_async`)
- `test_strip_tags` (and `test_strip_tags_async`)
- `test_convert_tags` (and `test_convert_tags_async`)

The reason for the difference is that there are trailing spaces when the
tests are run in the CI checks, and no trailing spaces when run with
`make test`. I ensured that the tests pass in CI, but they may fail with
`make test` due to the addition of trailing spaces.

---------

Co-authored-by: Erick Friis <[email protected]>
Thank you for contributing to LangChain!

 **PR title**: "community: fix  PDF Filter Type Error"


  - **Description:** fix  PDF Filter Type Error"
  - **Issue:** the issue langchain-ai#27153 it fixes,
  - **Dependencies:** no
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!



- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <[email protected]>
@efriis
Copy link
Member

efriis commented Dec 14, 2024

hey there! I believe this conflicted with a change adding a filtering syntax to FAISS. Would you be interested in re-implementing just the get_by_ids function based on those filters? Otherwise will probably close for now!

@efriis efriis self-assigned this Dec 14, 2024
ronidas39 and others added 13 commits December 14, 2024 00:21
Added Langchain complete tutorial playlist from total technology zonne
channel .In this playlist every video is focusing one specific use case
and hands on demo.All tutorials are equally good for every levels .

Thank you for contributing to LangChain!

- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core, etc. is
being modified. Use "docs: ..." for purely docs changes, "infra: ..."
for CI changes.
  - Example: "community: add foobar LLM"


- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <[email protected]>
Co-authored-by: Erick Friis <[email protected]>
Co-authored-by: Jesse Schumacher <[email protected]>
Co-authored-by: Jesse S <[email protected]>
Co-authored-by: dylan <[email protected]>
Issue: Here is an ambiguity about W&B integrations. There are two
existing provider pages.
Fix: Added the "root" W&B provider page. Added there the references to
the documentation in the W&B site. Cleaned up formats in existing pages.
Added one more integration reference.

---------

Co-authored-by: Erick Friis <[email protected]>
Co-authored-by: Eugene Yurtsev <[email protected]>
- **Description:** Adds a helper that renders documents with the
GraphVectorStore metadata fields to Graphviz for visualization. This is
helpful for understanding and debugging.

---------

Co-authored-by: Erick Friis <[email protected]>
Thank you for contributing to LangChain!

- [x] **PR title**: langchain: add URL parameter to ChatDeepInfra class

- [x] **PR message**: add URL parameter to ChatDeepInfra class
- **Description:** This PR introduces a url parameter to the
ChatDeepInfra class in LangChain, allowing users to specify a custom
URL. Previously, the URL for the DeepInfra API was hardcoded to
"https://stage.api.deepinfra.com/v1/openai/chat/completions", which
caused issues when the staging endpoint was not functional. The _url
method was updated to return the value from the url parameter, enabling
greater flexibility and addressing the problem. out!

---------

Co-authored-by: Erick Friis <[email protected]>
… values in index_to_docstore_id, implement get_by_ids method
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 15, 2024
@nhols nhols closed this Dec 15, 2024
@nhols
Copy link
Contributor Author

nhols commented Dec 15, 2024

@efriis opened a new PR! #28728

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community size:XXL This PR changes 1000+ lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.